104 research outputs found

    Automated group assignment in large phylogenetic trees using GRUNT: GRouping, Ungrouping, Naming Tool

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Accurate taxonomy is best maintained if species are arranged as hierarchical groups in phylogenetic trees. This is especially important as trees grow larger as a consequence of a rapidly expanding sequence database. Hierarchical group names are typically manually assigned in trees, an approach that becomes unfeasible for very large topologies.</p> <p>Results</p> <p>We have developed an automated iterative procedure for delineating stable (monophyletic) hierarchical groups to large (or small) trees and naming those groups according to a set of sequentially applied rules. In addition, we have created an associated ungrouping tool for removing existing groups that do not meet user-defined criteria (such as monophyly). The procedure is implemented in a program called GRUNT (GRouping, Ungrouping, Naming Tool) and has been applied to the current release of the Greengenes (Hugenholtz) 16S rRNA gene taxonomy comprising more than 130,000 taxa.</p> <p>Conclusion</p> <p>GRUNT will facilitate researchers requiring comprehensive hierarchical grouping of large tree topologies in, for example, database curation, microarray design and pangenome assignments. The application is available at the greengenes website <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>.</p

    An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea

    Get PDF
    Reference phylogenies are crucial for providing a taxonomic framework for interpretation of marker gene and metagenomic surveys, which continue to reveal novel species at a remarkable rate. Greengenes is a dedicated full-length 16S rRNA gene database that provides users with a curated taxonomy based on de novo tree inference. We developed a β€˜taxonomy to tree' approach for transferring group names from an existing taxonomy to a tree topology, and used it to apply the Greengenes, National Center for Biotechnology Information (NCBI) and cyanoDB (Cyanobacteria only) taxonomies to a de novo tree comprising 408 315 sequences. We also incorporated explicit rank information provided by the NCBI taxonomy to group names (by prefixing rank designations) for better user orientation and classification consistency. The resulting merged taxonomy improved the classification of 75% of the sequences by one or more ranks relative to the original NCBI taxonomy with the most pronounced improvements occurring in under-classified environmental sequences. We also assessed candidate phyla (divisions) currently defined by NCBI and present recommendations for consolidation of 34 redundantly named groups. All intermediate results from the pipeline, which includes tree inference, jackknifing and transfer of a donor taxonomy to a recipient tree (tax2tree) are available for download. The improved Greengenes taxonomy should provide important infrastructure for a wide range of megasequencing projects studying ecosystems on scales ranging from our own bodies (the Human Microbiome Project) to the entire planet (the Earth Microbiome Project). The implementation of the software can be obtained from http://sourceforge.net/projects/tax2tree/

    Simrank: Rapid and sensitive general-purpose k-mer search tool

    Get PDF
    Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project (http://nihroadmap.nih.gov/hmp). Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence database partitioning, guide tree estimation, molecular classification and alignment acceleration have benefited from embedded k-mer searches as sub-routines. However, a rapid, general-purpose, open-source, flexible, stand-alone k-mer tool has not been available. Here we present a stand-alone utility, Simrank, which allows users to rapidly identify database strings the most similar to query strings. Performance testing of Simrank and related tools against DNA, RNA, protein and human-languages found Simrank 10X to 928X faster depending on the dataset. Simrank provides molecular ecologists with a high-throughput, open source choice for comparing large sequence sets to find similarity

    Piphillin predicts metagenomic composition and dynamics from DADA2-corrected 16S rDNA sequences

    Get PDF
    Shotgun metagenomic sequencing reveals the potential in microbial communities. However, lower-cost 16S ribosomal RNA (rRNA) gene sequencing provides taxonomic, not functional, observations. To remedy this, we previously introduced Piphillin, a software package that predicts functional metagenomic content based on the frequency of detected 16S rRNA gene sequences corresponding to genomes in regularly updated, functionally annotated genome databases. Piphillin (and similar tools) have previously been evaluated on 16S rRNA data processed by the clustering of sequences into operational taxonomic units (OTUs). New techniques such as amplicon sequence variant error correction are in increased use, but it is unknown if these techniques perform better in metagenomic content prediction pipelines, or if they should be treated the same as OTU data in respect to optimal pipeline parameters

    Foregut microbiome in development of esophageal adenocarcinoma

    Get PDF
    Esophageal adenocarcinoma (EA), the type of cancer linked to heartburn due to gastroesophageal reflux diseases (GERD), has increased six fold in the past 30 years. This cannot currently be explained by the usual environmental or by host genetic factors. EA is the end result of a sequence of GERD-related diseases, preceded by reflux esophagitis (RE) and Barrett&#x2019;s esophagus (BE). Preliminary studies by Pei and colleagues at NYU on elderly male veterans identified two types of microbiotas in the esophagus. Patients who carry the type II microbiota are &#x3e;15 fold likely to have esophagitis and BE than those harboring the type I microbiota. In a small scale study, we also found that 3 of 3 cases of EA harbored the type II biota. The findings have opened a new approach to understanding the recent surge in the incidence of EA. &#xd;&#xa;&#xd;&#xa;Our long-term goal is to identify the cause of GERD sequence. The hypothesis to be tested is that changes in the foregut microbiome are associated with EA and its precursors, RE and BE in GERD sequence. We will conduct a case control study to demonstrate the microbiome disease association in every stage of GERD sequence, as well as analyze the trend in changes in the microbiome along disease progression toward EA, by two specific aims. Aim 1 is to conduct a comprehensive population survey of the foregut microbiome and demonstrate its association with GERD sequence. Furthermore, spatial relationship between the esophageal microbiota and upstream (mouth) and downstream (stomach) foregut microbiotas as well as temporal stability of the microbiome-disease association will also be examined. Aim 2 is to define the distal esophageal metagenome and demonstrate its association with GERD sequence. Detailed analyses will include pathway-disease and gene-disease associations. Archaea, fungi and viruses, if identified, also will be correlated with the diseases. A significant association between the foregut microbiome and GERD sequence, if demonstrated, will be the first step for eventually testing whether an abnormal microbiome is required for the development of the sequence of phenotypic changes toward EA. If EA and its precursors represent a microecological disease, treating the cause of GERD might become possible, for example, by normalizing the microbiota through use of antibiotics, probiotics, or prebiotics. Causative therapy of GERD could prevent its progression and reverse the current trend of increasing incidence of EA

    Strain level and comprehensive microbiome analysis in inflammatory bowel disease via multi-technology meta-analysis identifies key bacterial influencers of disease

    Get PDF
    Inflammatory bowel disease (IBD) is a heterogenous disease in which the microbiome has been shown to play an important role. However, the precise homeostatic or pathological functions played by bacteria remain unclear. Most published studies report taxa-disease associations based on single-technology analysis of a single cohort, potentially biasing results to one clinical protocol, cohort, and molecular analysis technology. To begin to address this key question, precise identification of the bacteria implicated in IBD across cohorts is necessary. We sought to take advantage of the numerous and diverse studies characterizing the microbiome in IBD to develop a multi-technology meta-analysis (MTMA) as a platform for aggregation of independently generated datasets, irrespective of DNA-profiling technique, in order to uncover the consistent microbial modulators of disease. We report the largest strain-level survey of IBD, integrating microbiome profiles from 3,407 samples from 21 datasets spanning 15 cohorts, three of which are presented for the first time in the current study, characterized using three DNA-profiling technologies, mapping all nucleotide data against known, culturable strain reference data. We identify several novel IBD associations with culturable strains that have so far remained elusive, including two genome-sequenced but uncharacterized Lachnospiraceae strains consistently decreased in both the gut luminal and mucosal contents of patients with IBD, and demonstrate that these strains are correlated with inflammation-related pathways that are known mechanisms targeted for treatment. Furthermore, comparative MTMA at the species versus strain level reveals that not all significant strain associations resulted in a corresponding species-level significance and conversely significant species associations are not always re-captured at the strain level. We propose MTMA for uncovering experimentally testable strain-disease associations that, as demonstrated here, are beneficial in discovering mechanisms underpinning microbiome impact on disease or novel targets for therapeutic interventions

    Characterization of Coastal Urban Watershed Bacterial Communities Leads to Alternative Community-Based Indicators

    Get PDF
    BACKGROUND: Microbial communities in aquatic environments are spatially and temporally dynamic due to environmental fluctuations and varied external input sources. A large percentage of the urban watersheds in the United States are affected by fecal pollution, including human pathogens, thus warranting comprehensive monitoring. METHODOLOGY/PRINCIPAL FINDINGS: Using a high-density microarray (PhyloChip), we examined water column bacterial community DNA extracted from two connecting urban watersheds, elucidating variable and stable bacterial subpopulations over a 3-day period and community composition profiles that were distinct to fecal and non-fecal sources. Two approaches were used for indication of fecal influence. The first approach utilized similarity of 503 operational taxonomic units (OTUs) common to all fecal samples analyzed in this study with the watershed samples as an index of fecal pollution. A majority of the 503 OTUs were found in the phyla Firmicutes, Proteobacteria, Bacteroidetes, and Actinobacteria. The second approach incorporated relative richness of 4 bacterial classes (Bacilli, Bacteroidetes, Clostridia and alpha-proteobacteria) found to have the highest variance in fecal and non-fecal samples. The ratio of these 4 classes (BBC:A) from the watershed samples demonstrated a trend where bacterial communities from gut and sewage sources had higher ratios than from sources not impacted by fecal material. This trend was also observed in the 124 bacterial communities from previously published and unpublished sequencing or PhyloChip- analyzed studies. CONCLUSIONS/SIGNIFICANCE: This study provided a detailed characterization of bacterial community variability during dry weather across a 3-day period in two urban watersheds. The comparative analysis of watershed community composition resulted in alternative community-based indicators that could be useful for assessing ecosystem health
    • …
    corecore